list.of.packages <- c("tidyverse", "tsoutliers", "lubridate", "hrbrthemes", "CausalImpact", "googleAnalyticsR", "ggrepel", "ggalt", "gridExtra", "broom", "knitr", "gplots")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(list.of.packages, require, character.only = TRUE)## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] TRUE
##
## [[6]]
## [1] FALSE
##
## [[7]]
## [1] TRUE
##
## [[8]]
## [1] TRUE
##
## [[9]]
## [1] TRUE
##
## [[10]]
## [1] TRUE
##
## [[11]]
## [1] TRUE
##
## [[12]]
## [1] TRUE
Objetivo:
ga:users y nuevos usuarios ga:newUsers dimensionados por fecha ga:date para todo el tráfico que viene a través de Google/CPC| date | users | newUsers |
|---|---|---|
| 2016-09-01 | 27 | 19 |
| 2016-09-02 | 29 | 24 |
| 2016-09-03 | 19 | 13 |
| 2016-09-04 | 28 | 23 |
| 2016-09-05 | 37 | 22 |
| 2016-09-06 | 37 | 24 |
Analytics no tiene una métrica returningUsers que nos facilite el número de usuarios recurrentes.
usuarios_desde_google_adwords <- usuarios_desde_google_adwords %>%
mutate(returningUsers = users - newUsers)
knitr::kable(head(usuarios_desde_google_adwords))| date | users | newUsers | returningUsers |
|---|---|---|---|
| 2016-09-01 | 27 | 19 | 8 |
| 2016-09-02 | 29 | 24 | 5 |
| 2016-09-03 | 19 | 13 | 6 |
| 2016-09-04 | 28 | 23 | 5 |
| 2016-09-05 | 37 | 22 | 15 |
| 2016-09-06 | 37 | 24 | 13 |
Cargamos el fichero con la inversión diaria realizada en pujas por palabras clave en Google Adwords.
| date | adCost |
|---|---|
| 2016-09-01 | 152.06 |
| 2016-09-02 | 161.67 |
| 2016-09-03 | 127.90 |
| 2016-09-04 | 127.92 |
| 2016-09-05 | 153.58 |
| 2016-09-06 | 162.66 |
| date | users | newUsers | returningUsers |
|---|---|---|---|
| 2016-09-01 | 27 | 19 | 8 |
| 2016-09-02 | 29 | 24 | 5 |
| date | adCost |
|---|---|
| 2016-09-01 | 152.06 |
| 2016-09-02 | 161.67 |
usuarios_costes_google_adwords <- left_join(usuarios_desde_google_adwords,
costes_google_adwords, by="date")| date | users | newUsers | returningUsers | adCost |
|---|---|---|---|---|
| 2016-09-01 | 27 | 19 | 8 | 152.06 |
| 2016-09-02 | 29 | 24 | 5 | 161.67 |
## date users newUsers returningUsers
## Min. :2016-09-01 Min. : 1.00 Min. : 0.00 Min. : 1.00
## 1st Qu.:2016-10-17 1st Qu.: 29.25 1st Qu.:20.00 1st Qu.: 8.25
## Median :2016-12-02 Median : 42.50 Median :27.50 Median :13.00
## Mean :2016-12-02 Mean : 50.38 Mean :36.13 Mean :14.25
## 3rd Qu.:2017-01-17 3rd Qu.: 71.00 3rd Qu.:51.75 3rd Qu.:19.00
## Max. :2017-03-05 Max. :132.00 Max. :98.00 Max. :34.00
## adCost
## Min. : 0.00
## 1st Qu.: 73.57
## Median :169.50
## Mean :187.09
## 3rd Qu.:280.25
## Max. :441.57
Todo análisis de serie temporal univariante, comienza con la presentación de un grafico donde se muestra la evolución de la variable a lo largo del tiempo.
¿Són similares ambas series?
¿Y ahora?
plot(density(usuarios_costes_google_adwords$returningUsers),
main="Usuarios recurrentes")boxplot(usuarios_costes_google_adwords$returningUsers,
main="Usuarios recurrentes")plot(usuarios_costes_google_adwords$adCost, usuarios_costes_google_adwords$newUsers,
main = "Adquisición e Inversión publicitaria")¿Exsite relación entre la inversión y el número de nuevos usuarios adquiridos?
¿Cómo de fuerte es la relación entre ambas variables?
¿Qué campaña contribuye más a la adquisición de nuevos usuarios?
¿Cón qué precisión podemos estimar el efecto de cada campaña?
¿Con qué seguridad podemos predecir el número de nuevos usuarios en el futuro?
¿Es una relación lineal?
¿Cómo influye la inversión en la adquisición?
usuarios_costes_google_adwords %>%
ggplot(aes(x = adCost, y = newUsers)) +
geom_point(color = "orange", size = 3, alpha = 0.8) usuarios_costes_google_adwords %>%
ggplot(aes(x = adCost, y = newUsers)) +
geom_point(color = "orange", size = 3, alpha = 0.8) +
geom_smooth()usuarios_costes_google_adwords %>%
ggplot(aes(x = adCost, y = newUsers)) +
geom_point(color = "orange", size = 3, alpha = 0.8) +
geom_smooth(method="lm")¿En qué variable podemos incidir para provocar un cambio en la otra variable?
y a nuestra variable dependiente o a explicar, newUsersx a la variable independiente o explicativa, adCosty <- usuarios_costes_google_adwords$newUsers
x <- usuarios_costes_google_adwords$adCost
y[1:5]## [1] 19 24 13 23 22
x[1:5]## [1] 152.06 161.67 127.90 127.92 153.58
¿Cuántos usuarios nuevos esperamos adquirir, si no tenemos en cuenta la inversión?
\[\hat y = \bar y\]
mean(y)## [1] 36.12903
\[\hat y = f(x)\]
\[newUsers = f(adCost)\]
\[\hat y = \beta_0 + \beta_1(x)\]
\[ -1 < r < 1 \]
cor(x,y)## [1] 0.918814
¿Es estadÃsticamente significativa la correlación entre \(x\) e \(y\)?
(ct <- cor.test(x, y))##
## Pearson's product-moment correlation
##
## data: x and y
## t = 31.578, df = 184, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8929860 0.9386106
## sample estimates:
## cor
## 0.918814
El p-value es < .5 por lo que podemos rechazar \(H_0\) (No existe correlación en la población) y afirmar que la correlación no es 0. El intervalo de confianza además es muy elevado indicando una fuerte correlación.
¿Cuánto debo invertir para captar un nuevo usuario?
| newUsers | adCost |
|---|---|
| 19 | 152 |
| 24 | 162 |
| 13 | 128 |
Como primer instinto, el CPA o precio medio a pagar por nuevo usuario serÃa \(y = f(x)\), \(cpa = f(newUsers)\) Por lo tanto \(y = \beta x\)
A partir de los datos anteriores $ (152 + 161 + 128) / (19 + 24 + 13) $
(cpa <- sum(x[1:3]) / sum(y[1:3]) )## [1] 7.88625
Coste de adquisición de un usuario \(CPA = 7.89 * 1 (usuario)\)
Nuestro modelo final: \(Usuarios = 1/7.89 * Gasto\)
Este es un modelo predictivo muy simple. Toma un valor de entrada (gasto en euros), aplica una función (1/7.89 * gasto), y devuelve un resultado (Usuarios).
Su nombre técnico: modelo de regresión.
## [1] 7.782076
new_users.lm <- lm(newUsers ~ adCost, data = usuarios_costes_google_adwords)
summary(new_users.lm)##
## Call:
## lm(formula = newUsers ~ adCost, data = usuarios_costes_google_adwords)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.6322 -5.8271 -0.2376 5.8206 25.7548
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.81243 1.20903 3.153 0.00189 **
## adCost 0.17274 0.00547 31.578 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.779 on 184 degrees of freedom
## Multiple R-squared: 0.8442, Adjusted R-squared: 0.8434
## F-statistic: 997.1 on 1 and 184 DF, p-value: < 2.2e-16
Los coeficientes del modelo
(coeficientes <- coef(new_users.lm))## (Intercept) adCost
## 3.812432 0.172736
¿Término independiente?
\[\hat y = \beta_0 + \beta_1 x\]
predict obtenemos el punto sobre la recta para cada valor de la variable independiente (adCost)# Obtain predicted and residual values
usuarios_costes_google_adwords$predicted <- predict(new_users.lm)
usuarios_costes_google_adwords$residuals<- residuals(new_users.lm)| adCost | newUsers | predicted | residuals |
|---|---|---|---|
| 152.06 | 19 | 30.08 | -11.08 |
| 161.67 | 24 | 31.74 | -7.74 |
| 127.90 | 13 | 25.91 | -12.91 |
\[MSE=\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2\]
\[RMSE=\sqrt{MSE}\]
(mean((usuarios_costes_google_adwords$newUsers - usuarios_costes_google_adwords$predicted)^2))^0.5 ## [1] 8.7321
summary(new_users.lm)$sigma## [1] 8.779429
confint(new_users.lm)## 2.5 % 97.5 %
## (Intercept) 1.4270902 6.1977737
## adCost 0.1619436 0.1835284
posibles_inversiones <- data.frame(adCost=c(50, 320)) 50
320
predict() ahora con nuevos datospredicciones <- as.data.frame(
predict(object = new_users.lm,
newdata = posibles_inversiones ,
interval='prediction')
)
predicciones <- cbind(posibles_inversiones,predicciones)| adCost | fit | lwr | upr |
|---|---|---|---|
| 50 | 12.44923 | -4.981462 | 29.87993 |
| 320 | 59.08795 | 41.661020 | 76.51488 |
Haz una regresión lineal que explique la influencia del gasto publicitario sobre los usuarios recurrentes. Compara los resultados con el anterior modelo.
old.par <- par(mfrow=c(1, 1))
par(mfrow=c(1, 2))
plot(density(usuarios_desde_google_adwords$newUsers), main="newUsers")
plot(density(usuarios_desde_google_adwords$returningUsers) , main = "returningUsers")par(old.par)¿Podemos comparar ambos distribuciones con los gráficos anteriores?
par(mfrow=c(1, 2))
boxplot(usuarios_desde_google_adwords$newUsers)
boxplot(usuarios_desde_google_adwords$returningUsers)par(old.par)##
## Welch Two Sample t-test
##
## data: usuarios_desde_google_adwords$newUsers and usuarios_desde_google_adwords$returningUsers
## t = 12.735, df = 227.13, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 18.49135 25.26133
## sample estimates:
## mean of x mean of y
## 36.12903 14.25269
El p-valor es menor que 0.5 por lo cual rechazamos la hipótesis nula \(H_0\) de que las medias son iguales.
Lo hacemos con la función ts() aunque hay más.
new_users.ts <- ts(usuarios_desde_google_adwords$newUsers,
frequency = 365,
start = decimal_date( ymd( min( usuarios_desde_google_adwords$date ) ) ) )## Time Series:
## Start = 2016.66666666667
## End = 2017.17351598174
## Frequency = 365
## [1] 19 24 13 23 22 24 24 25 27 21 14 23 27 37 23 21 17 22 24 31 20 22 26
## [24] 18 23 30 26 35 25 15 18 26 25 30 21 26 20 17 15 39 38 26 43 32 23 23
## [47] 49 58 38 61 54 36 34 38 35 52 40 35 22 20 22 29 37 44 43 42 32 56 39
## [70] 40 53 39 28 22 46 39 48 26 41 27 28 50 42 26 27 25 20 13 25 38 31 21
## [93] 22 12 13 20 15 22 8 10 15 7 16 26 19 25 22 9 20 23 19 18 20 14 6
## [116] 5 11 13 11 17 15 12 13 16 27 17 22 9 10 14 27 65 38 0 0 0 0 41
## [139] 31 51 47 43 36 37 75 64 78 68 62 55 67 75 78 83 98 58 65 70 97 78 86
## [162] 70 63 48 59 89 72 73 73 57 41 55 68 74 65 67 73 70 56 73 59 56 98 72
## [185] 62 67
new_users.ts.fit <- tsoutliers::tso(new_users.ts,
types = c("AO","LS","TC","IO"),
maxit.iloop=20,
tsmethod = "auto.arima")
new_users.ts.fit## Series: new_users.ts
## ARIMA(1,1,1)
##
## Coefficients:
## ar1 ma1 IO132 LS138 LS145 AO183
## 0.3506 -0.8884 27.5391 31.1767 30.7773 33.4055
## s.e. 0.0847 0.0371 7.1354 6.7581 6.6955 8.6727
##
## sigma^2 estimated as 94.79: log likelihood=-680.96
## AIC=1375.91 AICc=1376.55 BIC=1398.46
##
## Outliers:
## type ind time coefhat tstat
## 1 IO 132 2017:10 27.54 3.859
## 2 LS 138 2017:16 31.18 4.613
## 3 LS 145 2017:23 30.78 4.597
## 4 AO 183 2017:61 33.41 3.852
plot(new_users.ts.fit)¿Ha supuesto este cambio algún impacto en nuestra captación de nuevos usuarios?
pre.period <- as.Date(c("2016-09-01", "2017-01-23"))
post.period <- as.Date(c("2017-01-24", "2017-03-05"))
impact <- CausalImpact(usuarios_desde_google_adwords %>% dplyr::select(date, newUsers),
pre.period,
post.period)¿Podemos cuantificarlo?
## Analysis report {CausalImpact}
##
##
## During the post-intervention period, the response variable had an average value of approx. 69.32. By contrast, in the absence of an intervention, we would have expected an average response of 26.24. The 95% interval of this counterfactual prediction is [20.83, 31.04]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is 43.08 with a 95% interval of [38.28, 48.49]. For a discussion of the significance of this effect, see below.
##
## Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 2.84K. By contrast, had the intervention not taken place, we would have expected a sum of 1.08K. The 95% interval of this prediction is [0.85K, 1.27K].
##
## The above results are given in terms of absolute numbers. In relative terms, the response variable showed an increase of +164%. The 95% interval of this percentage is [+146%, +185%].
##
## This means that the positive effect observed during the intervention period is statistically significant and unlikely to be due to random fluctuations. It should be noted, however, that the question of whether this increase also bears substantive significance can only be answered by comparing the absolute effect (43.08) to the original goal of the underlying intervention.
##
## The probability of obtaining this effect by chance is very small (Bayesian one-sided tail-area probability p = 0.001). This means the causal effect can be considered statistically significant.
ga:users, ga:newUsers y ga:bounces dimensionadas por fecha y campañaga_auth()
google_adwords_metricas <- google_analytics(id = "46728973",
start="2016-09-01", end="2017-03-05",
metrics = c("ga:users", "ga:newUsers", "ga:bounces"),
dimensions = c("ga:date, ga:campaign"),
segment = c("sessions::condition::ga:sourceMedium=@google;ga:medium=@cpc"),
sort = c("ga:date"),
max=99999999,
samplingLevel="HIGHER_PRECISION"
)
knitr::kable(head(google_adwords_metricas))| date | campaign | users | newUsers | bounces |
|---|---|---|---|---|
| 2016-09-01 | (not set) | 2 | 1 | 1 |
| 2016-09-01 | 0 - MARCA | 5 | 2 | 4 |
| 2016-09-01 | ALI - MASTER - MARKETING ONLINE | 1 | 0 | 1 |
| 2016-09-01 | BCN - MASTER - ANALITICA WEB | 1 | 1 | 1 |
| 2016-09-01 | BCN - MASTER - UX | 1 | 0 | 0 |
| 2016-09-01 | MAD - CURSO - ESTADISTICA | 1 | 1 | 0 |
costes_por_campana <- read_csv('data/costes_por_campana.csv')
knitr::kable(head(costes_por_campana))| date | campaign | impressions | adClicks | adCost | cpc |
|---|---|---|---|---|---|
| 2016-09-01 | 0 - MARCA | 18 | 4 | 0.82 | 0.205 |
| 2016-09-01 | 0 - MARCA BCN | 6 | 0 | 0.00 | 0.000 |
| 2016-09-01 | ALI - MASTER - MARKETING ONLINE | 10 | 1 | 5.91 | 5.910 |
| 2016-09-01 | AND - MASTER - ANALITICA WEB | 12 | 0 | 0.00 | 0.000 |
| 2016-09-01 | AND - MASTER - MARKETING ONLINE | 10 | 0 | 0.00 | 0.000 |
| 2016-09-01 | AND - MASTER - SEO y SEM | 36 | 0 | 0.00 | 0.000 |
¿Hay datos llamativos?
summary(google_adwords_metricas)## date campaign users newUsers
## Min. :2016-09-01 Length:2546 Min. : 1.000 Min. : 0.000
## 1st Qu.:2016-10-08 Class :character 1st Qu.: 1.000 1st Qu.: 1.000
## Median :2016-11-19 Mode :character Median : 2.000 Median : 1.000
## Mean :2016-12-01 Mean : 3.704 Mean : 2.639
## 3rd Qu.:2017-02-01 3rd Qu.: 3.000 3rd Qu.: 3.000
## Max. :2017-03-05 Max. :49.000 Max. :29.000
## bounces
## Min. : 0.000
## 1st Qu.: 1.000
## Median : 1.000
## Mean : 2.178
## 3rd Qu.: 2.000
## Max. :21.000
summary(costes_por_campana)## date campaign impressions
## Min. :2016-09-01 Length:3538 Min. : 1.00
## 1st Qu.:2016-10-03 Class :character 1st Qu.: 11.00
## Median :2016-11-15 Mode :character Median : 23.00
## Mean :2016-11-28 Mean : 73.81
## 3rd Qu.:2017-01-29 3rd Qu.: 49.75
## Max. :2017-03-05 Max. :4714.00
## adClicks adCost cpc
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 1.000 Median : 5.285 Median : 2.440
## Mean : 2.781 Mean : 9.836 Mean : 4.368
## 3rd Qu.: 3.000 3rd Qu.:14.360 3rd Qu.: 6.867
## Max. :53.000 Max. :95.860 Max. :76.070
campanas <- left_join(google_adwords_metricas, costes_por_campana, by = c("date", "campaign"))Descubrimos ya la presencia de errores de medición:
| date | campaign | users | newUsers | bounces | impressions | adClicks | adCost | cpc |
|---|---|---|---|---|---|---|---|---|
| 2016-09-01 | (not set) | 2 | 1 | 1 | NA | NA | NA | NA |
| 2016-09-01 | 0 - MARCA | 5 | 2 | 4 | 18 | 4 | 0.82 | 0.205 |
| 2016-09-01 | ALI - MASTER - MARKETING ONLINE | 1 | 0 | 1 | 10 | 1 | 5.91 | 5.910 |
| 2016-09-01 | BCN - MASTER - ANALITICA WEB | 1 | 1 | 1 | 8 | 1 | 17.94 | 17.940 |
| 2016-09-01 | BCN - MASTER - UX | 1 | 0 | 0 | 14 | 2 | 5.75 | 2.875 |
| 2016-09-01 | MAD - CURSO - ESTADISTICA | 1 | 1 | 0 | 19 | 1 | 0.60 | 0.600 |
campanas <- campanas %>%
filter(!grepl("MARCA|Competencia|(not set)", campaign)) %>%
separate(campaign, c("ciudad", "programa", "nombre"), " - ") %>%
mutate(ctr = round(adClicks / impressions,2)) | date | ciudad | programa | nombre | users | newUsers | bounces | impressions | adClicks | adCost | cpc | ctr |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2016-09-01 | ALI | MASTER | MARKETING ONLINE | 1 | 0 | 1 | 10 | 1 | 5.91 | 5.910 | 0.10 |
| 2016-09-01 | BCN | MASTER | ANALITICA WEB | 1 | 1 | 1 | 8 | 1 | 17.94 | 17.940 | 0.12 |
| 2016-09-01 | BCN | MASTER | UX | 1 | 0 | 0 | 14 | 2 | 5.75 | 2.875 | 0.14 |
| 2016-09-01 | MAD | CURSO | ESTADISTICA | 1 | 1 | 0 | 19 | 1 | 0.60 | 0.600 | 0.05 |
| 2016-09-01 | MAD | CURSO | OPEN STACK | 1 | 1 | 1 | 30 | 1 | 0.57 | 0.570 | 0.03 |
| 2016-09-01 | MAD | CURSO | PROGRAMACION | 1 | 1 | 1 | 19 | 1 | 1.14 | 1.140 | 0.05 |
Comprobamos la existencia de valores nulos
## date ciudad programa
## Min. :2016-09-01 Length:2084 Length:2084
## 1st Qu.:2016-10-04 Class :character Class :character
## Median :2016-11-14 Mode :character Mode :character
## Mean :2016-11-30
## 3rd Qu.:2017-02-03
## Max. :2017-03-05
##
## nombre users newUsers bounces
## Length:2084 Min. : 1.000 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
## Mode :character Median : 1.000 Median : 1.000 Median : 1.000
## Mean : 2.275 Mean : 1.917 Mean : 1.718
## 3rd Qu.: 3.000 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :21.000 Max. :21.000 Max. :21.000
##
## impressions adClicks adCost cpc
## Min. : 1.00 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 14.00 1st Qu.: 1.000 1st Qu.: 4.66 1st Qu.: 2.800
## Median : 28.00 Median : 2.000 Median : 9.54 Median : 5.463
## Mean : 50.94 Mean : 2.437 Mean :14.20 Mean : 6.838
## 3rd Qu.: 55.00 3rd Qu.: 3.000 3rd Qu.:19.09 3rd Qu.: 9.190
## Max. :1003.00 Max. :25.000 Max. :95.86 Max. :76.070
## NA's :15 NA's :15 NA's :15 NA's :15
## ctr
## Min. :0.0000
## 1st Qu.:0.0400
## Median :0.0600
## Mean :0.0979
## 3rd Qu.:0.1100
## Max. :1.5000
## NA's :15
Son NA generados por la ausencia de datos en el dataset costes_por_campana.
campanas <- campanas %>% na.omit## date ciudad programa
## Min. :2016-09-01 Length:2069 Length:2069
## 1st Qu.:2016-10-04 Class :character Class :character
## Median :2016-11-13 Mode :character Mode :character
## Mean :2016-11-30
## 3rd Qu.:2017-02-03
## Max. :2017-03-05
## nombre users newUsers bounces
## Length:2069 Min. : 1.000 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000
## Mode :character Median : 2.000 Median : 1.000 Median : 1.000
## Mean : 2.284 Mean : 1.931 Mean : 1.727
## 3rd Qu.: 3.000 3rd Qu.: 2.000 3rd Qu.: 2.000
## Max. :21.000 Max. :21.000 Max. :21.000
## impressions adClicks adCost cpc
## Min. : 1.00 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 14.00 1st Qu.: 1.000 1st Qu.: 4.66 1st Qu.: 2.800
## Median : 28.00 Median : 2.000 Median : 9.54 Median : 5.463
## Mean : 50.94 Mean : 2.437 Mean :14.20 Mean : 6.838
## 3rd Qu.: 55.00 3rd Qu.: 3.000 3rd Qu.:19.09 3rd Qu.: 9.190
## Max. :1003.00 Max. :25.000 Max. :95.86 Max. :76.070
## ctr
## Min. :0.0000
## 1st Qu.:0.0400
## Median :0.0600
## Mean :0.0979
## 3rd Qu.:0.1100
## Max. :1.5000
##
## Call:
## lm(formula = newUsers ~ 0 + adCost + nombre, data = campanas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9158 -0.6138 -0.0131 0.4312 7.1890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## adCost 0.060527 0.001963 30.836 < 2e-16 ***
## nombreANALITICA WEB 0.389573 0.068722 5.669 1.64e-08 ***
## nombreAPACHE FLINK 0.000000 0.810518 0.000 1.00000
## nombreBIG DATA 0.508654 0.094988 5.355 9.52e-08 ***
## nombreCRO IGNITE 0.496703 0.199542 2.489 0.01288 *
## nombreDATA SCIENCE 0.697965 0.088276 7.907 4.28e-15 ***
## nombreESTADISTICA 1.509175 0.145675 10.360 < 2e-16 ***
## nombreHADOOP & SPARK 0.434026 0.810521 0.535 0.59237
## nombreINBOUND MK 1.197230 0.306439 3.907 9.65e-05 ***
## nombreINTRO BLOCKCHAIN 13.423692 0.190186 70.582 < 2e-16 ***
## nombreMARKETING ONLINE 0.352708 0.070150 5.028 5.39e-07 ***
## nombreMK CONTENIDO PERIODISTAS 1.054212 0.234271 4.500 7.18e-06 ***
## nombreOPEN STACK 1.052198 0.382086 2.754 0.00594 **
## nombrePROGRAMACION 1.259516 0.197112 6.390 2.05e-10 ***
## nombreR 0.915319 0.179024 5.113 3.47e-07 ***
## nombreSEO y SEM 1.181027 0.065176 18.121 < 2e-16 ***
## nombreSOLUCIONES NEGOCIO 0.901148 0.162135 5.558 3.08e-08 ***
## nombreUX 1.537406 0.068685 22.384 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.146 on 2051 degrees of freedom
## Multiple R-squared: 0.849, Adjusted R-squared: 0.8476
## F-statistic: 640.5 on 18 and 2051 DF, p-value: < 2.2e-16
plotmeans(newUsers ~ ciudad, data=campanas, digits=2, ccol='red', mean.labels=T, main="Plot of breast cancer means by continent")## Warning in arrows(x, li, x, pmax(y - gap, li), col = barcol, lwd = lwd, :
## zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(x, ui, x, pmin(y + gap, ui), col = barcol, lwd = lwd, :
## zero-length arrow is of indeterminate angle and so skipped
aov_campanas <- aov(campanas$newUsers ~ campanas$ciudad)
summary(aov_campanas)## Df Sum Sq Mean Sq F value Pr(>F)
## campanas$ciudad 5 458 91.55 19.54 <2e-16 ***
## Residuals 2063 9667 4.69
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
El p-valor es menor que .05, por lo que aceptamos la hipótesis \(H_1\) the que hay una relación significativa entre cada ciudad y la captación de nuevos usuarios.
EL test de Tukey
(tukey <- TukeyHSD(aov_campanas))## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = campanas$newUsers ~ campanas$ciudad)
##
## $`campanas$ciudad`
## diff lwr upr p adj
## AND-ALI -0.11071429 -1.63210203 1.4106735 0.9999472
## BCN-ALI 0.25220345 -0.95591267 1.4603196 0.9913591
## MAD-ALI 1.32989836 0.14701516 2.5127816 0.0171144
## NAC-ALI 0.73530539 -0.46041214 1.9310229 0.4960038
## VAL-ALI 0.06184669 -1.45188209 1.5755755 0.9999970
## BCN-AND 0.36291774 -0.66230521 1.3881407 0.9148461
## MAD-AND 1.44061265 0.44524829 2.4359770 0.0005424
## NAC-AND 0.84601968 -0.16456330 1.8566027 0.1608549
## VAL-AND 0.17256098 -1.19963156 1.5447535 0.9992243
## MAD-BCN 1.07769491 0.70935525 1.4460346 0.0000000
## NAC-BCN 0.48310194 0.07542332 0.8907806 0.0096071
## VAL-BCN -0.19035676 -1.20417936 0.8234658 0.9947229
## NAC-MAD -0.59459297 -0.91996963 -0.2692163 0.0000030
## VAL-MAD -1.26805167 -2.25166967 -0.2844337 0.0032983
## VAL-NAC -0.67345870 -1.67247428 0.3255569 0.3883007
Un p-valor < .05 indica que hay una diferencia significativa entre ambas ciudades. Si es mayor, que no la hay.
Los lÃneas que no contienen el 0 representan aquellas ciudades en las que hay diferencias significativas.
plot(tukey)masters_aw_ds <- campanas %>%
dplyr::filter(nombre %in% c("ANALITICA WEB" , "DATA SCIENCE"))
masters_aw_ds <- masters_aw_ds %>%
mutate(ctr = adClicks / impressions)
masters_aw_ds %>%
group_by(nombre) %>%
summarise(avg_ctr = mean(ctr, na.rm = T))## # A tibble: 2 × 2
## nombre avg_ctr
## <chr> <dbl>
## 1 ANALITICA WEB 0.12427219
## 2 DATA SCIENCE 0.07907015
| nombre | avg_ctr |
|---|---|
| ANALITICA WEB | 0.1242722 |
| DATA SCIENCE | 0.0790701 |
# Vectores
ctr_aw <- masters_aw_ds %>%
filter(nombre == "ANALITICA WEB") %>%
dplyr::select(ctr) %>% unlist()
ctr_ds <- masters_aw_ds %>%
filter(nombre == "DATA SCIENCE") %>%
dplyr::select(ctr) %>% unlist()mean(ctr_aw, na.rm = T)## [1] 0.1242722
mean(ctr_ds, na.rm = T)## [1] 0.07907015
var(ctr_aw, na.rm = T);## [1] 0.01689408
var(ctr_ds, na.rm = T)## [1] 0.005366735
sd(ctr_aw, na.rm = T)## [1] 0.1299772
sd(ctr_ds, na.rm = T)## [1] 0.07325801
var.test(ctr_aw,ctr_ds)##
## F test to compare two variances
##
## data: ctr_aw and ctr_ds
## F = 3.1479, num df = 313, denom df = 229, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 2.466729 3.998878
## sample estimates:
## ratio of variances
## 3.147926
El p-value < 0.05 implica rechazar H0 y por lo tanto ambas varianzas no son iguales. (La variazión de CTR es disinto para cada campaña).
##
## Shapiro-Wilk normality test
##
## data: ctr_aw
## W = 0.69855, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: ctr_ds
## W = 0.70436, p-value < 2.2e-16
En ambos casos el p-value es menor que 0.05 , por lo que rechazamos la hipótesis nula de que el CTR sigue una distribución normal en ambos grupos de usuarios.
Debido a esto no podemos aplicar un t.test(ctr_aw, ctr_ds)
wilcox.test(ctr_aw, ctr_ds)##
## Wilcoxon rank sum test with continuity correction
##
## data: ctr_aw and ctr_ds
## W = 45142, p-value = 6.12e-07
## alternative hypothesis: true location shift is not equal to 0
Como el p-value < 0.5, rechazamos la hipótesis nula y por lo tanto ambas poblaciones (camapañas) son disintas.